text-mining-group-work

Vladyslav Gorbunov, Larissa Pagliarin

2024-06-02

Data Import and Preparation

We Import the dataset SP500_data.csv and make a copy to work with it and named it data. We copy it so we can be secure that i do not make any changes in the original dataset.
We use several libraries to process the tasks and get the output that is asked.


Data Exploration

This section gives a concise view of the Tweets on the Swiss Univerity Social Media accounts data.
The dataset consists 19’575 observations and 14 variables:

Time Range and Tweet Frequency:

  • Tweets are from September 29, 2009, to January 26, 2023 and this indicates a long-term use of Twitter
  • The median tweet date is April 13, 2018, suggesting that half of the tweets were posted after this date and the data is skewed

*Retweet and Favorite Counts:**

  • The data shows a minimum of 0 and a maximum of 267 retweets and 188 likes per tweet
  • the median and first quartile for retweets and likes are 0, indicating that many tweets receive little to no engagement
  • The in_reply_to_screen_name field suggests that some tweets are responses to other users, which might indicate engagement or conversation strategies used by the university

ID and String Variables:
- The id and id_str fields are technical identifiers for tweets, indicating that tweets have been collected over a wide range of tweet

Language and University Fields:

  • The lang shows the common language used at the university
  • university shows the abbreviation of the university

Temporal Patterns:

  • created_at, tweet_date, tweet_hour, and tweet_month provide detailed temporal data
  • can be analyzed to understand peak times of activity and seasonal or monthly trends in tweeting behavior.

Content Analysis

The word cloud represents the most frequently used words in the filtered tweets with high engagement (likes or retweets). Key observations include:

Frequent Terms: Larger words such as “bachelor,” “design,” “die,” “das,” “der,” and “amp” indicate their higher occurrence. Key Topics: “bachelor” for Bachelor’s programs or graduates. “design” related to design courses or projects. “HSLU” (Hochschule Luzern). General terms: “schweiz,” “zeigen,” “nicht.” Note: The term “amp” appears due to HTML encoding and is not meaningful.


# Display first six rows of 'tweets'
head(tweets)
## # A tibble: 6 × 14
##   created_at               id id_str            full_text in_reply_to_screen_n…¹
##   <dttm>                <dbl> <chr>             <chr>     <chr>                 
## 1 2023-01-20 17:17:32 1.62e18 1616469988369469… "Im MSc … <NA>                  
## 2 2023-01-13 07:52:01 1.61e18 1613790954737074… "Was bew… <NA>                  
## 3 2023-01-12 19:30:01 1.61e18 1613604227141537… "Was uns… <NA>                  
## 4 2023-01-12 08:23:00 1.61e18 1613436367169634… "Eine di… <NA>                  
## 5 2023-01-11 14:00:05 1.61e18 1613158809081450… "Wir gra… <NA>                  
## 6 2023-01-10 17:06:11 1.61e18 1612843252083834… "Unsere … <NA>                  
## # ℹ abbreviated name: ¹​in_reply_to_screen_name
## # ℹ 9 more variables: retweet_count <int>, favorite_count <int>, lang <chr>,
## #   university <chr>, tweet_date <dttm>, tweet_minute <dttm>,
## #   tweet_hour <dttm>, tweet_month <date>, timeofday_hour <chr>
# Provide summary statistics
summary(tweets)
##    created_at                          id                     
##  Min.   :2009-09-29 14:29:47.0   Min.   :         4468752018  
##  1st Qu.:2015-01-28 15:07:41.5   1st Qu.: 560439073866000000  
##  Median :2018-04-13 13:26:56.0   Median : 984754806702000000  
##  Mean   :2017-12-09 15:26:50.7   Mean   : 939953703992000000  
##  3rd Qu.:2020-10-20 10:34:50.0   3rd Qu.:1318470720360000000  
##  Max.   :2023-01-26 14:49:31.0   Max.   :1618607065240000000  
##     id_str           full_text         in_reply_to_screen_name
##  Length:19575       Length:19575       Length:19575           
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##  retweet_count     favorite_count       lang            university       
##  Min.   :  0.000   Min.   :  0.00   Length:19575       Length:19575      
##  1st Qu.:  0.000   1st Qu.:  0.00   Class :character   Class :character  
##  Median :  1.000   Median :  0.00   Mode  :character   Mode  :character  
##  Mean   :  1.289   Mean   :  1.37                                        
##  3rd Qu.:  2.000   3rd Qu.:  2.00                                        
##  Max.   :267.000   Max.   :188.00                                        
##    tweet_date                      tweet_minute                   
##  Min.   :2009-09-29 00:00:00.00   Min.   :2009-09-29 14:29:00.00  
##  1st Qu.:2015-01-28 00:00:00.00   1st Qu.:2015-01-28 15:07:00.00  
##  Median :2018-04-13 00:00:00.00   Median :2018-04-13 13:26:00.00  
##  Mean   :2017-12-09 02:25:45.00   Mean   :2017-12-09 15:26:24.68  
##  3rd Qu.:2020-10-20 00:00:00.00   3rd Qu.:2020-10-20 10:34:30.00  
##  Max.   :2023-01-26 00:00:00.00   Max.   :2023-01-26 14:49:00.00  
##    tweet_hour                      tweet_month         timeofday_hour    
##  Min.   :2009-09-29 14:00:00.00   Min.   :2009-09-01   Length:19575      
##  1st Qu.:2015-01-28 14:30:00.00   1st Qu.:2015-01-01   Class :character  
##  Median :2018-04-13 13:00:00.00   Median :2018-04-01   Mode  :character  
##  Mean   :2017-12-09 14:59:43.81   Mean   :2017-11-24                     
##  3rd Qu.:2020-10-20 10:00:00.00   3rd Qu.:2020-10-01                     
##  Max.   :2023-01-26 14:00:00.00   Max.   :2023-01-01

Data Manipulation

Languages

Here we calculate the frequency of each language present in the tweets dataset and sorts these frequencies in descending order.
The output indicates that German (de) is the most common language with 14,474 occurrences, followed by Italian (it) with 1,865 and French (fr) with 1,792. English (en) comes next with 1,280 tweets. The frequencies of other languages, including rare and less commonly used ones, are also listed, showcasing the linguistic diversity in the dataset.

# Count the frequency of each language
lang_counts <- table(tweets$lang)

# Sort the language frequencies in descending order
sort(lang_counts, decreasing = TRUE)
## 
##    de    it    fr    en   qam   qme    es    ca    da    ro    nl    in    et 
## 14474  1865  1792  1280    35    21    19    10    10    10     9     7     6 
##   und    pt   zxx   art    lv    cy    fi    lt    no   qht    cs    eu    ht 
##     6     4     4     3     3     2     2     2     2     2     1     1     1 
##    ja    sv    tl    tr 
##     1     1     1     1


Due to the fact that German, Italian, French and English are the most frequently listed languages and other languages are not used in large numbers and are not among the most spoken languages in Switzerland, we limit the data set to the 4 most important ones.

# Filter the DataFrame to keep only tweets in German, Italian, French and English
filtered_tweets <- tweets[tweets$lang %in% c("de", "it", "fr", "en"), ]

# Check the resulting language distribution
table(filtered_tweets$lang)
## 
##    de    en    fr    it 
## 14474  1280  1792  1865


This gives us the new Summeray of the data set:

  • Number of Records: The total count of tweets has decreased from 19,575 to 19,411, indicating some tweets have been removed or filtered out.
  • Date and Time: Minimal changes are reflected across the median and mean values.
  • Other Attributes: No significant changes are observed in the ranges.
# Provide summary statistics
summary(filtered_tweets)
##    created_at                           id                     
##  Min.   :2009-09-29 14:29:47.00   Min.   :         4468752018  
##  1st Qu.:2015-02-04 11:39:32.00   1st Qu.: 562923403041000000  
##  Median :2018-04-17 13:53:07.00   Median : 986210946744999936  
##  Mean   :2017-12-11 15:27:49.55   Mean   : 940675313339000064  
##  3rd Qu.:2020-10-20 11:09:15.50   3rd Qu.:1318479385120000000  
##  Max.   :2023-01-26 14:49:31.00   Max.   :1618607065240000000  
##     id_str           full_text         in_reply_to_screen_name
##  Length:19411       Length:19411       Length:19411           
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##  retweet_count     favorite_count        lang            university       
##  Min.   :  0.000   Min.   :  0.000   Length:19411       Length:19411      
##  1st Qu.:  0.000   1st Qu.:  0.000   Class :character   Class :character  
##  Median :  1.000   Median :  0.000   Mode  :character   Mode  :character  
##  Mean   :  1.293   Mean   :  1.376                                        
##  3rd Qu.:  2.000   3rd Qu.:  2.000                                        
##  Max.   :267.000   Max.   :188.000                                        
##    tweet_date                     tweet_minute                   
##  Min.   :2009-09-29 00:00:00.0   Min.   :2009-09-29 14:29:00.00  
##  1st Qu.:2015-02-04 00:00:00.0   1st Qu.:2015-02-04 11:39:00.00  
##  Median :2018-04-17 00:00:00.0   Median :2018-04-17 13:53:00.00  
##  Mean   :2017-12-11 02:26:53.7   Mean   :2017-12-11 15:27:23.56  
##  3rd Qu.:2020-10-20 00:00:00.0   3rd Qu.:2020-10-20 11:09:00.00  
##  Max.   :2023-01-26 00:00:00.0   Max.   :2023-01-26 14:49:00.00  
##    tweet_hour                      tweet_month         timeofday_hour    
##  Min.   :2009-09-29 14:00:00.00   Min.   :2009-09-01   Length:19411      
##  1st Qu.:2015-02-04 11:30:00.00   1st Qu.:2015-02-01   Class :character  
##  Median :2018-04-17 13:00:00.00   Median :2018-04-01   Mode  :character  
##  Mean   :2017-12-11 15:00:42.28   Mean   :2017-11-26                     
##  3rd Qu.:2020-10-20 10:30:00.00   3rd Qu.:2020-10-01                     
##  Max.   :2023-01-26 14:00:00.00   Max.   :2023-01-01

Emojis

The package emo is used for emoji analysis in R, which is essential for text data that includes emojis. This is useful for cleaning data, extracting information, or preparing text for further analysis.
Understanding the prevalence of emojis can help analyze sentiment, user engagement, or cultural trends in social media data.

# Install the emo package from GitHub for Emoji analyzes
if (!require("emo")) {
  remotes::install_github("hadley/emo")
}
## Lade nötiges Paket: emo
library(emo)

Text Preprocessing

We create a text corpus from filtered_tweets$clean_text, where each tweet is treated as a separate document.
The corpus serves as the foundational structure for text analysis, allowing for uniform processing and manipulation of the text data.

# Corpus: Collection of text documents that generally serves as a basis for analysis in text processing and text mining.
# VectorSource(tweets): This vector is then used as the source for the corpus, whereby each entry in the vector becomes a separate document in the corpus.
# It is important that the text is extracted, as the corpus should only work with text data.
corpus <- Corpus(VectorSource(filtered_tweets$clean_text))


Here we clean the corpus by converting all text to lowercase, removing punctuation, numbers, and stopwords from German, French, Italian, and English, and finally stripping extra spaces.
Cleaning the text is crucial for reducing noise and focusing analyses on meaningful words only. This standardizes the text data, making subsequent analyses like topic modeling or sentiment analysis more effective and less prone to error due to textual inconsistencies.

# Clean text
corpus <- tm_map(corpus, content_transformer(tolower))  # Convert to lower case
corpus <- tm_map(corpus, removePunctuation)             # Removing punctuation marks
corpus <- tm_map(corpus, removeNumbers)                 # Removing numbers
corpus <- tm_map(corpus, removeWords, stopwords("german"))  # Removing stop words
corpus <- tm_map(corpus, removeWords, stopwords("french"))
corpus <- tm_map(corpus, removeWords, stopwords("italian"))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)               # Removal of additional spaces
corpus <- tm_map(corpus, stemDocument) #remove suffixes, etc.; only root form of the word

# Further clean the text by removing specific web/text symbols and terms
corpus <- tm_map(corpus, content_transformer(function(x) {
  x <- gsub("–", "", x)
  x <- gsub("…", "", x) 
  x <- gsub("«", "", x) 
  x <- gsub("»", "", x) 
  x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE)  # Remove 'rt', 'www', and 'emojiemoji'
  x <- gsub("amp", "", x, ignore.case = TRUE)  # Remove 'amp' from HTML encoded '&'
  x <- gsub("http[s]?://\\S+", "", x)  # Remove URLs
  return(x)
}))


Here we create a Document-Term Matrix (DTM) from the corpus, applying additional filters like punctuation removal and stopping word exclusion during the matrix formation. Then, it filters out terms that appear in less than 1% of the documents to reduce sparsity.
Reducing sparsity helps focus on terms that have significant presence across documents, enhancing the reliability and performance of statistical models and algorithms applied later.

# Create DTM and remove sparse terms
dtm1 <- DocumentTermMatrix(corpus, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm1 <- removeSparseTerms(dtm1, sparse = 0.99)  # Adjust sparsity threshold as needed

Tweet Analysis

Frequency

The function returns a vector of terms that meet the specified frequency threshold. In this case, terms such as “schweizer”, “bfh”, “neuen”, “emoji”, and others are listed, indicating they are common within the dataset. By setting a high frequency threshold (e.g., 25 occurrences), you can focus on terms that are more relevant across the dataset.

# Check the term frequencies
findFreqTerms(dtm1, lowfreq = 25)  # Shows terms that occur at least 25 times
##  [1] "dank"            "innov"           "schweizer"       "unternehmen"    
##  [5] "arbeit"          "dass"            "zeigt"           "bfh"            
##  [9] "neuen"           "unser"           "digital"         "entwickelt"     
## [13] "zukunft"         "emoji"           "startup"         "mehr"           
## [17] "neue"            "gibt"            "info"            "berner"         
## [21] "blogbeitrag"     "statt"           "sozial"          "beim"           
## [25] "menschen"        "projekt"         "bern"            "geht"           
## [29] "team"            "schweiz"         "dabei"           "forschung"      
## [33] "studi"           "bfhhesb"         "cus"             "informatik"     
## [37] "technik"         "studierenden"    "depart"          "anmelden"       
## [41] "onlin"           "uhr"             "thema"           "heut"           
## [45] "busi"            "ab"              "wurd"            "zwei"           
## [49] "swiss"           "bachelor"        "digit"           "interview"      
## [53] "immer"           "digitalisierung" "findet"          "institut"       
## [57] "morgen"          "zhaw"            "dr"              "herzlich"       
## [61] "jahr"            "erst"            "hochschul"       "kunst"          
## [65] "rahmen"          "fachhochschul"   "mai"             "srf"            
## [69] "wünschen"        "studierend"      "master"          "scienc"         
## [73] "student"         "design"          "fhnw"            "prof"           
## [77] "hsafhnw"         "manag"           "via"             "studium"        
## [81] "luzern"          "hslu"            "social"          "fhnwtechnik"    
## [85] "htwchur"         "cc"              "hesso"           "infoanlass"     
## [89] "http"            "projet"          "chur"            "htw"            
## [93] "engineeringzhaw" "fhnwbusi"        "supsi"           "graubünden"


Words like “schweizer” (Swiss), “unternehmen” (companies), “zukunft” (future), “innov” (innovation), and “digital” suggest that the text data heavily revolves around themes of Swiss companies, innovation, and digital advancements.
Frequent appearance of terms like “dank” (thanks), “neue” (new), “mehr” (more), and “info” indicate common communication patterns possibly related to news dissemination or updates about new developments and initiatives.

set.seed(123)
# Ensure word names are captured
word_freq1 <- sort(rowSums(as.matrix(dtm1)), decreasing = TRUE)
top_word_freq1 <- head(word_freq1, 80)
word_names1 <- colnames(dtm1)

# Generate word cloud using the correct word names
wordcloud(
  words = word_names1, 
  freq = top_word_freq1, 
  max.words = 80,
  scale = c(4, 0.5),       # Control for size of the most and least frequent words
  random.order = FALSE,    # Higher frequency words appear first
  rot.per = 0.25,          # Allows some rotation for fitting
  colors = brewer.pal(8, "Dark2")  # Enhances visual appeal
)

# Code to analyze tweet frequencies by time and university
p1<- filtered_tweets %>%
  mutate(tweet_month = floor_date(created_at, "month")) %>%
  group_by(university, tweet_month) %>%
  summarize(count = n(), .groups = 'drop') %>%
  ggplot(aes(x = tweet_month, y = count, fill = university)) +
  geom_col(position = "dodge") +
  theme_minimal() +
  labs(title = "Monthly Tweet Frequency by University", x = "Year", y = "Number of Tweets")

# Convert to interactive plotly object
interactive_plot <- ggplotly(p1, tooltip = "text")

# Optionally, add configurations to enhance interaction
interactive_plot <- interactive_plot %>% layout(
  hovermode = 'closest',
  title = "Click on a University to see its Tweet Trends",
  showlegend = TRUE
)

interactive_plot

High Engagement

This section sets a variable engagement_threshold to 20, which is used as the minimum number of likes or retweets a tweet must have to be considered as having “high engagement”. This threshold helps to focus on tweets that have garnered more attention and interaction.

# Set a threshold for "high engagement" (e.g., tweets with at least 20 likes or retweets)
engagement_threshold <- 20

# Filter tweets based on this engagement threshold
high_engagement_tweets <- filtered_tweets %>%
  filter(favorite_count >= engagement_threshold | retweet_count >= engagement_threshold)


Also for the high_engagement_tweets we clean the corpus by converting all text to lowercase, removing punctuation, numbers, and stopwords from German, French, Italian, and English, and finally stripping extra spaces and we create a Document-Term Matrix (DTM) from this corpus.

# Rebuild the corpus with the sampled data
corpus2 <- Corpus(VectorSource(high_engagement_tweets$clean_text)) 

corpus2 <- tm_map(corpus2, content_transformer(tolower))  # Convert to lower case
corpus2 <- tm_map(corpus2, removePunctuation)             # Removing punctuation marks
corpus2 <- tm_map(corpus2, removeNumbers)                 # Removing numbers
corpus2 <- tm_map(corpus2, removeWords, stopwords("german"))  # Removing stop words
corpus2 <- tm_map(corpus2, removeWords, stopwords("french"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("italian"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("english"))
corpus2 <- tm_map(corpus2, stripWhitespace)               # Removal of additional spaces
corpus2 <- tm_map(corpus2, stemDocument) #remove suffixes, etc.; only root form of the word

# Further clean the text by removing specific web/text symbols and terms
corpus2 <- tm_map(corpus2, content_transformer(function(x) {
  x <- gsub("–", "", x)
  x <- gsub("…", "", x) 
  x <- gsub("«", "", x) 
  x <- gsub("»", "", x) 
  x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE)  # Remove 'rt', 'www', and 'emojiemoji'
  x <- gsub("amp", "", x, ignore.case = TRUE)  # Remove 'amp' from HTML encoded '&'
  x <- gsub("http[s]?://\\S+", "", x)  # Remove URLs
  return(x)
}))

# Create DTM and remove sparse terms
dtm <- DocumentTermMatrix(corpus2, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm <- removeSparseTerms(dtm, sparse = 0.99)  # Adjust sparsity threshold as needed


The word cloud effectively illustrates which topics are most engaging within the parameter for at least 20 likes or retweets. This visualization can help in refining the communication and engagement strategies by focusing on the topics that naturally engage your audience.

  • The word cloud highlights Words like “digital,” “Data,” and “Open” emphasize a strong focus on digital innovation and open data or technology. This suggests that tweets discussing digital technologies or data transparency tend to receive higher engagement.
  • Terms such as “forscherteam” (research team), “univers” (universities), and “lab” indicate that the content related to academic research or laboratory work resonates well with the audience. This could be within a university setting or tech-related academic research.
  • Words like “revolutionieren” (revolutionize), “entwickelt” (developed), and “chanc” (chances) suggest discussions around innovation and development are highly engaging.
  • “Gespräch” (conversation/discussion) indicates that interactive or discussion-based tweets, perhaps those inviting comments or thoughts from the community, are among those that receive more likes and retweets.
  • Words like “mithilf” (with help) and phrases possibly related to collaboration highlight the community.
set.seed(123)
# Ensure word names are captured
word_freq <- sort(rowSums(as.matrix(dtm)), decreasing = TRUE)
top_word_freq <- head(word_freq, 80)
word_names <- colnames(dtm)

# Generate word cloud using the correct word names
wordcloud(
  words = word_names, 
  freq = top_word_freq, 
  max.words = 80,
  scale = c(4, 0.5),       # Control for size of the most and least frequent words
  random.order = FALSE,    # Higher frequency words appear first
  rot.per = 0.25,          # Allows some rotation for fitting
  colors = brewer.pal(8, "Dark2")  # Enhances visual appeal
)

# Analyze the frequency of different emojis
emoji_freq1 <- table(unlist(high_engagement_tweets$emojis))
sort(emoji_freq1, decreasing = TRUE)
## 
##  ➡️ 🇨🇭  ⤵️ ✨ 🇨🇳 🇬🇧 🇳🇱 🇸🇪 🇸🇬 👉 💛 📅 📢  🗞️ 😀 😉 🚊 🚨 
##  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

Engagement Analysis by University

The bar chart visualizes the total likes accumulated by different universities within the parameter for at least 20 likes or retweets, highlighting variations in engagement across these institutions on social media.
The visualization clearly shows which universities are receiving the most engagement in terms of likes. HSLU (Lucerne University of Applied Sciences and Arts) and ZHAW (Zurich University of Applied Sciences) stands out with the highest engagement, significantly more than other institutions. So institutions like HSLU and ZHAW, offering a pathway for others to refine their social media tactics.

# Analysis of likes and retweets
high_engagement_tweets %>%
  group_by(university) %>%
  summarize(total_likes = sum(favorite_count), total_retweets = sum(retweet_count), .groups = 'drop') %>%
  ggplot(aes(x = reorder(university, total_likes), y = total_likes)) +
  geom_col() +
  coord_flip() +
  labs(title = "Engagement Analysis by University", x = "University", y = "Total Likes")

HSLU & ZHAW Engagement Analysis

#Filter Tweets for HSLU and ZHAW
hslu_zhaw_tweets <- filtered_tweets %>%
  filter(university %in% c("HSLU", "ZHAW"))

# Set a threshold for "high engagement" (e.g., tweets with at least 10 likes or retweets)
engagement_threshold1 <- 10

# Filter tweets based on this engagement threshold
hslu_zhaw_high_engagement_tweets <- hslu_zhaw_tweets %>%
  filter(favorite_count >= engagement_threshold1 | retweet_count >= engagement_threshold1)

# Rebuild the corpus with the sampled data
corpus3 <- Corpus(VectorSource(hslu_zhaw_high_engagement_tweets$clean_text)) 

corpus3 <- tm_map(corpus3, content_transformer(tolower))  # Convert to lower case
corpus3 <- tm_map(corpus3, removePunctuation)             # Removing punctuation marks
corpus3 <- tm_map(corpus3, removeNumbers)                 # Removing numbers
corpus3 <- tm_map(corpus3, removeWords, stopwords("german"))  # Removing stop words
corpus3 <- tm_map(corpus3, removeWords, stopwords("french"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("italian"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("english"))
corpus3 <- tm_map(corpus3, stripWhitespace)               # Removal of additional spaces
corpus3 <- tm_map(corpus3, stemDocument) #remove suffixes, etc.; only root form of the word

# Further clean the text by removing specific web/text symbols and terms
corpus3 <- tm_map(corpus3, content_transformer(function(x) {
  x <- gsub("–", "", x)
  x <- gsub("…", "", x) 
  x <- gsub("«", "", x) 
  x <- gsub("»", "", x) 
  x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE)  # Remove 'rt', 'www', and 'emojiemoji'
  x <- gsub("amp", "", x, ignore.case = TRUE)  # Remove 'amp' from HTML encoded '&'
  x <- gsub("http[s]?://\\S+", "", x)  # Remove URLs
  return(x)
}))

# Create DTM and remove sparse terms
dtm2 <- DocumentTermMatrix(corpus3, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm2 <- removeSparseTerms(dtm2, sparse = 0.99)  # Adjust sparsity threshold as needed
set.seed(123)
# Ensure word names are captured
word_freq2 <- sort(rowSums(as.matrix(dtm2)), decreasing = TRUE)
top_word_freq2 <- head(word_freq2, 80)
word_names2 <- colnames(dtm2)

# Generate word cloud using the correct word names
wordcloud(
  words = word_names2, 
  freq = top_word_freq2, 
  max.words = 80,
  scale = c(4, 0.5),       # Control for size of the most and least frequent words
  random.order = FALSE,    # Higher frequency words appear first
  rot.per = 0.25,          # Allows some rotation for fitting
  colors = brewer.pal(8, "Dark2")  # Enhances visual appeal
)

# Analyze the frequency of different emojis
emoji_freq <- table(unlist(hslu_zhaw_high_engagement_tweets$emojis))
sort(emoji_freq, decreasing = TRUE)
## 
## 👉  ➡️  ⚖️ 🇨🇭 🌍 🌳 👋 💛 💬 💻 📈 📰 😀 🤔 🥑 🥗 
##  4  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1

Recommendations

Conclusion